Skip to content

Conversation

@JasonTam
Copy link

@JasonTam JasonTam commented Jul 7, 2025

Description

VertexAI offers training "Custom Jobs" which is a little more flexible than their baseline kfp component.
For example, this was the only way I could set number of replicas to distribute across multiple workers at the vertex level.

This PR introduces the config key distributed_training which allows you to enable the custom training job operator kfp component if you want to distribute training with >1 replica.

For now, the settings for distribution (ie: number of replicas) is configured to be the same for all enabled tasks. ie) You cannot set different number of replicas depending on the task

I admit, I do wish the vertex /kubeflow would allow replicas to be set on the standard components. That would make things simpler
Looking for feedback & thoughts! Thanks

PR Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant